Decoding Consumer Sentiment: Analyzing Trends in Yelp Reviews
Review data is often used by consumers to make many decisions, such as where they should eat and what products they should purchase. It is also used by businesses for purposes such as getting feedback and comparing themselves with their competitors. When using this data, it is important to understand trends in the data so you can more accurately make decisions based on what you see. I looked at Yelp data spanning from 2005 to 2022 to see what trends I could find. I analyzed the sentiment of around 50,000 reviews and found expected insights such as reviews that registered as more positive tended to be associated with higher star ratings. I also found less intuitive insights like reviews with lower star ratings tended to be more neutral, with lower variability in their tone compared with high starred reviews, which tended to be more polarized, but with higher variability. Finally, I selected a few categories of businesses and and explored how their star ratings and tone compared to each other and varied over time.
# # Extract data from tarfile
# with tarfile.open("yelp_dataset.tgz", "r") as all_data:
# # Extract all members of the archive
# all_data.extractall(filter="tar")
# # Open and read the review JSON file
# review_rows = []
# with open("yelp_academic_dataset_review.json", "r", encoding="utf-8") as file:
# for line in file:
# # Load each line as a separate JSON object (row)
# row = json.loads(line)
# review_rows.append(row)
# # Convert the list of rows into a pandas DataFrame
# review = pd.DataFrame(review_rows)
# # Open and read the business JSON file
# business_rows = []
# with open(
# "yelp_academic_dataset_business.json", "r", encoding="utf-8"
# ) as file:
# for line in file:
# # Load each line as a separate JSON object (row)
# row = json.loads(line)
# business_rows.append(row)
# # Convert the list of rows into a pandas DataFrame
# business = pd.DataFrame(business_rows)
# # Take a sample of the businesses
# business_sample = business.sample(n=1000, random_state=1)
# # Combine the two datasets
# combined = business_sample.merge(
# review, how="inner", on="business_id", suffixes=("_business", "_review")
# )
# # Convert the combination into a csv
# combined.to_csv("yelp_combined.csv", index=False)
The Yelp data contained many data points, which made running any code take longer than reasonable. Because of this, I decided to take a random sample of 1,000 business from the business dataset and join it with the review dataset, so I had all the reviews for each business in my random sample.
# Import libraries
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display
from textblob import TextBlob
import ipywidgets as widgets
import scipy.stats as stats
import scikit_posthocs as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import json
from itertools import combinations
import statsmodels.stats.multitest as smm
%matplotlib inline
pio.templates.default = "plotly_white"
I then did a bit of feature engineering by creating a new variable that represented the length of each review. I also calculated the polarity, how negative (-1) or positive (1) the text is, and subjectivity, how factual (0) or opinionated (1) the text is, of each review using the TextBlob library. I also created a new variable, Abs-polarity, which measured the strength of the polarity by taking the absolute value of the polarity.
# Download data
combined = pd.read_csv("yelp_combined.csv", encoding="utf-8")
# # Calculating the length of each review
# combined["Review Length"] = combined["text"].apply(len)
# # Loop through each row and calculate sentiments
# for index, text in combined["text"].items():
# blob = TextBlob(text)
# combined.at[index, "Polarity"] = blob.sentiment.polarity
# combined.at[index, "Subjectivity"] = blob.sentiment.subjectivity
# # Calculate polarity strength
# combined["Abs_polarity"] = abs(combined["Polarity"])
Below are the first five rows of the dataset as well as some preliminary statistics. I also checked for null values. The only variables that had null values were address, attributes, and hours, none of which I use in this analysis.
# Display the first five rows
display(combined.head())
# Show some preliminary statistics
display(combined.describe())
# Check for null values
print(combined.isnull().sum())
| business_id | name | address | city | state | postal_code | latitude | longitude | stars_business | review_count | ... | funny | cool | text | date | Review Length | Star Sentiment | Polarity | Subjectivity | Abs_polarity | Sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | L_f14MSPdkgHI81mN9--bw | Luca Italian Leather | 100 2nd Ave NE | Saint Petersburg | FL | 33701 | 27.7733 | -82.633806 | 5.0 | 5 | ... | 1 | 1 | This neat store opened not too long ago. They... | 2016-03-03 01:25:38 | 629 | Positive | 0.347708 | 0.678889 | 0.347708 | Positive |
| 1 | L_f14MSPdkgHI81mN9--bw | Luca Italian Leather | 100 2nd Ave NE | Saint Petersburg | FL | 33701 | 27.7733 | -82.633806 | 5.0 | 5 | ... | 0 | 0 | Phenomenal quality and service. Willing to acc... | 2021-03-17 22:47:11 | 85 | Positive | 0.609375 | 0.716667 | 0.609375 | Positive |
| 2 | L_f14MSPdkgHI81mN9--bw | Luca Italian Leather | 100 2nd Ave NE | Saint Petersburg | FL | 33701 | 27.7733 | -82.633806 | 5.0 | 5 | ... | 0 | 0 | Great store, unique high quality Italian leath... | 2020-02-13 18:25:08 | 233 | Positive | 0.390000 | 0.578750 | 0.390000 | Positive |
| 3 | L_f14MSPdkgHI81mN9--bw | Luca Italian Leather | 100 2nd Ave NE | Saint Petersburg | FL | 33701 | 27.7733 | -82.633806 | 5.0 | 5 | ... | 0 | 0 | My husbanded needed shoes four our wedding. Sa... | 2021-02-14 18:53:03 | 312 | Positive | 0.215260 | 0.558157 | 0.215260 | Positive |
| 4 | L_f14MSPdkgHI81mN9--bw | Luca Italian Leather | 100 2nd Ave NE | Saint Petersburg | FL | 33701 | 27.7733 | -82.633806 | 5.0 | 5 | ... | 0 | 0 | Gorgeous leather handbags, jackets and shoes. ... | 2021-03-21 18:32:32 | 155 | Positive | 0.700000 | 0.950000 | 0.700000 | Positive |
5 rows × 28 columns
| latitude | longitude | stars_business | review_count | is_open | stars_review | useful | funny | cool | Review Length | Polarity | Subjectivity | Abs_polarity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 50363.000000 | 50363.000000 | 50363.000000 | 50363.00000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 | 50363.000000 |
| mean | 35.559602 | -90.478860 | 3.786351 | 277.43657 | 0.817823 | 3.785477 | 1.172508 | 0.325139 | 0.506106 | 570.645573 | 0.248511 | 0.564448 | 0.288057 |
| std | 5.491401 | 15.121928 | 0.751343 | 305.53432 | 0.385994 | 1.465665 | 2.812676 | 1.363873 | 2.050777 | 532.092638 | 0.237985 | 0.134091 | 0.188194 |
| min | 27.688229 | -119.887051 | 1.000000 | 5.00000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 17.000000 | -1.000000 | 0.000000 | 0.000000 |
| 25% | 29.938914 | -90.337143 | 3.500000 | 48.00000 | 1.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 230.000000 | 0.109670 | 0.484946 | 0.143234 |
| 50% | 36.325741 | -86.188308 | 4.000000 | 152.00000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 409.000000 | 0.254762 | 0.562143 | 0.266667 |
| 75% | 39.910601 | -82.287938 | 4.500000 | 381.00000 | 1.000000 | 5.000000 | 1.000000 | 0.000000 | 0.000000 | 721.000000 | 0.394643 | 0.644444 | 0.400000 |
| max | 53.631919 | -74.700195 | 5.000000 | 1291.00000 | 1.000000 | 5.000000 | 132.000000 | 77.000000 | 131.000000 | 5000.000000 | 1.000000 | 1.000000 | 1.000000 |
business_id 0 name 0 address 1027 city 0 state 0 postal_code 0 latitude 0 longitude 0 stars_business 0 review_count 0 is_open 0 attributes 1033 categories 0 hours 2667 review_id 0 user_id 0 stars_review 0 useful 0 funny 0 cool 0 text 0 date 0 Review Length 0 Star Sentiment 0 Polarity 0 Subjectivity 0 Abs_polarity 0 Sentiment 0 dtype: int64
To get a sense of the variables I was working with, I first used bar graphs and histograms to get a sense of their distributions. As you can see, the most common star rating that a review would give was a 5. 4 was the next highest, closely followed by 1. Not as many reviews gave a rating of 2 or 3. As one might expect, the distribution of average star rating of a business looks more normal. It’s centered at 4, with more extreme values being less common. Given that 5 is the highest possible rating, the distribution is not symmetric, since the right tail is cut off.
# Bar Plot of the distribution of stars
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.set_style("white")
sns.countplot(data=combined, x="stars_review", color="darkviolet", ax=axes[0])
axes[0].set_title("Distribution of Review Star Ratings")
axes[0].set_xlabel("Review Star Rating")
axes[0].set_ylabel("Count")
axes[0].grid(True, axis="y")
sns.countplot(
data=combined, x="stars_business", color="darkviolet", ax=axes[1]
)
axes[1].set_title("Distribution of Business Star Ratings")
axes[1].set_xlabel("Business Star Rating")
axes[1].set_ylabel("Count")
axes[1].grid(True, axis="y")
plt.tight_layout()
plt.show()
The distributions of the review polarities and subjectivities also appear approximately normal. The polarity appears relatively symmetric and centered around 0.25. The subjectivity appears relatively symmetric and centered around 0.55. Both polarity and subjectivity seem to have centers that are slightly higher than the midpoint of their respective ranges. Polarity trends slightly more positive and subjectivity trends slightly more opinionated than factual. Absolute polarity is not as similar. The most common polarity strength is around 2.25, but there is a clear right skew to the data, with more lower absolute polarities than higher. This means that reviews tended to be more balanced and less strictly positive or negative.
# Histogram of the distribution of Polarity and Subjectivity
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
sns.set_style("white")
sns.histplot(
data=combined, x="Polarity", color="darkviolet", kde=True, ax=axes[0]
)
axes[0].set_title("Distribution of Review Polarity")
axes[0].set_xlabel("Review Polarity")
axes[0].set_ylabel("Count")
axes[0].grid(True, axis="y")
sns.histplot(
data=combined, x="Subjectivity", color="darkviolet", kde=True, ax=axes[1]
)
axes[1].set_title("Distribution of Review Subjectivity")
axes[1].set_xlabel("Review Subjectivity")
axes[1].set_ylabel("Count")
axes[1].grid(True, axis="y")
sns.histplot(
data=combined, x="Abs_polarity", color="darkviolet", kde=True, ax=axes[2]
)
axes[2].set_title("Distribution of Absolute Polarity")
axes[2].set_xlabel("Absolute Polarity")
axes[2].set_ylabel("Count")
axes[2].grid(True, axis="y")
plt.tight_layout()
plt.show()
Each business came with many different categories, some of which were not very distinct (ex. Restaurants vs. Food). In order to have more digestible insights, I narrowed it down to six categories that I thought were both common and distinct from each other. Below is a bar graph of the count of each category. Although the “Restaurant” category has the most data points by far at 35,093 points, the other categories have plenty of data points themselves, with the least frequent, “Hotels & Travel”, having 1,601.
# Convert the 'categories' column into a list
combined["categories"] = combined["categories"].str.split(",")
# Exploding category column to get individual categories and then count them
# Convert the 'categories' column into a list
combined["categories"] = combined["categories"].apply(
lambda x: [category.strip().title() for category in x]
)
categories_to_filter = [
"Restaurants",
"Event Planning & Services",
"Shopping",
"Beauty & Spas",
"Arts & Entertainment",
"Hotels & Travel",
]
combined_filtered = combined[
combined["categories"].apply(
lambda x: any(cat in x for cat in categories_to_filter)
)
]
combined_filtered_exploded = combined_filtered.explode("categories")
combined_filtered_exploded = combined_filtered_exploded[
combined_filtered_exploded["categories"].isin(categories_to_filter)
].reset_index(drop=True)
category_order = combined_filtered_exploded["categories"].value_counts().index
# display(combined_filtered_exploded["categories"].value_counts())
# Create barplot of business categories
sns.countplot(
data=combined_filtered_exploded,
y="categories",
order=category_order,
color="darkviolet",
zorder=10,
)
plt.title("Distribution of Main Business Categories")
plt.ylabel("Category")
plt.xlabel("Count")
plt.grid(axis="x", zorder=0, lw=0.5)
Despite the star rating being the intended polarity of a review,
sometimes the content of a review doesn’t fully match that. The
below graphs explore the relationships between star rating and
the polarity, subjectivity, and polarity strength of the reviews.
The overall trend between polarity and star rating is generally as
expected: a higher star rating is associated with a higher polarity.
It is interesting to note, however, that even for a star rating of 1,
the average polarity is only slightly below 0, indicating only a
vaguely negative tone.
The relationship between star rating and subjectivity is less
pronounced, but also shows a positive relationship. A higher star
rating correlates with more subjectivity. That isn’t the only
pattern, though. The spread of subjectivities is greater for more
extreme star ratings than it is for the middle star ratings. In other
words, more extreme star ratings were associated with both
higher and lower subjectivities while the middle star ratings
tended to be more consistent.
Finally, the relationship between star rating and polarity strength
(absolute polarity) is also positive. This is somewhat unexpected
because it means that lower star ratings tended to be more neutral
than higher star ratings, when you might expect the middle star
ratings to be more neutral. However, this is consistent with the
first graph, which showed the lower star ratings’ polarities
centered around 0, and moving upwards (one direction of more
extreme) from there. Additionally, the distribution of polarity
strength is more bulbous for star ratings of 1 and 2, and more
spread for the higher star ratings. This indicates that for lower
star ratings, polarity strength is more consistent while for the
higher star ratings, it’s more variable.
# Boxplots of Review Star Ratings with Sentiments
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
# Plot for Abs_polarity
sns.boxplot(
data=combined,
x="stars_review",
y="Abs_polarity",
color="white",
fliersize=1,
linecolor="black",
ax=axes[2],
zorder=0,
)
sns.violinplot(
data=combined,
x="stars_review",
y="Abs_polarity",
color="darkviolet",
ax=axes[2],
alpha=0.25,
inner=None,
zorder=10,
)
axes[2].set_title("Absolute Polarity vs. Star Rating")
axes[2].set_xlabel("Star Rating")
axes[2].set_ylabel("Absolute Polarity")
# Plot for Polarity
sns.boxplot(
data=combined,
x="stars_review",
y="Polarity",
color="white",
fliersize=1,
linecolor="black",
ax=axes[0],
zorder=0,
)
sns.violinplot(
data=combined,
x="stars_review",
y="Polarity",
color="darkviolet",
ax=axes[0],
alpha=0.25,
inner=None,
zorder=10,
)
axes[0].set_title("Polarity vs. Star Rating")
axes[0].set_xlabel("Star Rating")
# Plot for Subjectivity
sns.boxplot(
data=combined,
x="stars_review",
y="Subjectivity",
color="white",
fliersize=1,
linecolor="black",
ax=axes[1],
zorder=0,
)
sns.violinplot(
data=combined,
x="stars_review",
y="Subjectivity",
color="darkviolet",
ax=axes[1],
alpha=0.25,
inner=None,
zorder=10,
)
axes[1].set_title("Subjectivity vs. Star Rating")
axes[1].set_xlabel("Star Rating")
plt.tight_layout()
plt.show()
# Boxplots of Business Categories with Sentiments
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
# Plot for Abs_polarity
sns.boxplot(
data=combined_filtered_exploded,
y="categories",
x="Abs_polarity",
order=category_order,
color="white",
fliersize=1,
linecolor="black",
ax=axes[2],
zorder=0,
)
sns.violinplot(
data=combined_filtered_exploded,
y="categories",
x="Abs_polarity",
order=category_order,
color="darkviolet",
ax=axes[2],
alpha=0.25,
inner=None,
zorder=10,
)
axes[2].set_title("Absolute Polarity vs. Business Category")
axes[2].set_ylabel("Business Category")
axes[2].set_xlabel("Absolute Polarity")
# Plot for Polarity
sns.boxplot(
data=combined_filtered_exploded,
y="categories",
x="Polarity",
order=category_order,
color="white",
fliersize=1,
linecolor="black",
ax=axes[0],
zorder=0,
)
sns.violinplot(
data=combined_filtered_exploded,
y="categories",
x="Polarity",
order=category_order,
color="darkviolet",
ax=axes[0],
alpha=0.25,
inner=None,
zorder=10,
)
axes[0].set_title("Polarity vs. Business Category")
axes[0].set_ylabel("Business Category")
# Plot for Subjectivity
sns.boxplot(
data=combined_filtered_exploded,
y="categories",
x="Subjectivity",
order=category_order,
color="white",
fliersize=1,
linecolor="black",
ax=axes[1],
zorder=0,
)
sns.violinplot(
data=combined_filtered_exploded,
y="categories",
x="Subjectivity",
order=category_order,
color="darkviolet",
ax=axes[1],
alpha=0.25,
inner=None,
zorder=10,
)
axes[1].set_title("Subjectivity vs. Business Category")
axes[1].set_ylabel("Business Category")
plt.tight_layout()
plt.show()
# Test ANOVA Assumptions for Polarity
model1 = smf.ols(
"Polarity ~ C(categories)", data=combined_filtered_exploded
).fit()
model2 = smf.ols(
"Subjectivity ~ C(categories)", data=combined_filtered_exploded
).fit()
model3 = smf.ols(
"Abs_polarity ~ C(categories)", data=combined_filtered_exploded
).fit()
residuals1 = model1.resid
fitted1 = model1.fittedvalues
residuals2 = model2.resid
fitted2 = model2.fittedvalues
residuals3 = model3.resid
fitted3 = model3.fittedvalues
fig, axes = plt.subplots(3, 2, figsize=(12, 12))
sns.residplot(
x=fitted1,
y=residuals1,
lowess=True,
line_kws={"color": "red"},
ax=axes[0, 0],
)
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs. Fitted Values")
sm.qqplot(residuals1, line="s", ax=axes[0, 1])
axes[0, 1].set_title("QQ Plot of Residuals")
sns.residplot(
x=fitted2,
y=residuals2,
lowess=True,
line_kws={"color": "red"},
ax=axes[1, 0],
)
axes[1, 0].set_xlabel("Fitted Values")
axes[1, 0].set_ylabel("Residuals")
axes[1, 0].set_title("Residuals vs. Fitted Values")
sm.qqplot(residuals2, line="s", ax=axes[1, 1])
axes[1, 1].set_title("QQ Plot of Residuals")
sns.residplot(
x=fitted3,
y=residuals3,
lowess=True,
line_kws={"color": "red"},
ax=axes[2, 0],
)
axes[2, 0].set_xlabel("Fitted Values")
axes[2, 0].set_ylabel("Residuals")
axes[2, 0].set_title("Residuals vs. Fitted Values")
sm.qqplot(residuals3, line="s", ax=axes[2, 1])
axes[2, 1].set_title("QQ Plot of Residuals")
plt.suptitle("ANOVA Assumptions", fontsize=16)
fig.text(
0.5, 0.925, "Polarity vs. Business Category", ha="center", fontsize=14
)
fig.text(
0.5, 0.62, "Subjectivity vs. Business Category", ha="center", fontsize=14
)
fig.text(
0.5,
0.31,
"Absolute Polarity vs. Business Category",
ha="center",
fontsize=14,
)
plt.tight_layout(h_pad=3.5)
fig.subplots_adjust(top=0.9)
plt.show()
c:\Users\emily\OneDrive\Documents\Python Study\Projects\main\Lib\site-packages\statsmodels\nonparametric\smoothers_lowess.py:226: RuntimeWarning: invalid value encountered in divide res, _ = _lowess(y, x, x, np.ones_like(x),
# Kruskal-Wallis Test for polarity vs. business category
restaurants = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Restaurants"
]["Polarity"]
events = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Polarity"]
shopping = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Shopping"
]["Polarity"]
beauty = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Polarity"]
art = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Polarity"]
travel = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Polarity"]
result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
combined_filtered_exploded,
val_col="Polarity",
group_col="categories",
p_adjust="bonferroni",
)
print(
"Kruskal Wallis & Dunn's Test Results for Polarity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Polarity vs. Business Category Kruskal-Wallis Statistic: 159.01 P-Value: 1.61e-32
| Arts & Entertainment | Beauty & Spas | Event Planning & Services | Hotels & Travel | Restaurants | Shopping | |
|---|---|---|---|---|---|---|
| Arts & Entertainment | False | True | True | False | True | False |
| Beauty & Spas | True | False | False | True | False | True |
| Event Planning & Services | True | False | False | True | False | True |
| Hotels & Travel | False | True | True | False | True | False |
| Restaurants | True | False | False | True | False | True |
| Shopping | False | True | True | False | True | False |
# Kruskal-Wallis Test for subjectivity vs. business category
restaurants = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Restaurants"
]["Subjectivity"]
events = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Subjectivity"]
shopping = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Shopping"
]["Subjectivity"]
beauty = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Subjectivity"]
art = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Subjectivity"]
travel = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Subjectivity"]
result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
combined_filtered_exploded,
val_col="Subjectivity",
group_col="categories",
p_adjust="bonferroni",
)
print(
"Kruskal Wallis & Dunn's Test Results for Subjectivity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Subjectivity vs. Business Category Kruskal-Wallis Statistic: 362.05 P-Value: 4.46e-76
| Arts & Entertainment | Beauty & Spas | Event Planning & Services | Hotels & Travel | Restaurants | Shopping | |
|---|---|---|---|---|---|---|
| Arts & Entertainment | False | False | True | False | True | False |
| Beauty & Spas | False | False | True | False | True | True |
| Event Planning & Services | True | True | False | True | False | True |
| Hotels & Travel | False | False | True | False | True | False |
| Restaurants | True | True | False | True | False | True |
| Shopping | False | True | True | False | True | False |
# Kruskal-Wallis Test for polarity vs. business category
restaurants = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Restaurants"
]["Abs_polarity"]
events = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Event Planning & Services"
]["Abs_polarity"]
shopping = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Shopping"
]["Abs_polarity"]
beauty = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Beauty & Spas"
]["Abs_polarity"]
art = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Arts & Entertainment"
]["Abs_polarity"]
travel = combined_filtered_exploded[
combined_filtered_exploded["categories"] == "Hotels & Travel"
]["Abs_polarity"]
result = stats.kruskal(restaurants, events, shopping, beauty, art, travel)
dunn = sp.posthoc_dunn(
combined_filtered_exploded,
val_col="Abs_polarity",
group_col="categories",
p_adjust="bonferroni",
)
print(
"Kruskal Wallis & Dunn's Test Results for Absolute Polarity vs. Business Category"
)
print(f"Kruskal-Wallis Statistic: {result.statistic:.2f}")
print(f"P-Value: {result.pvalue:.2e}")
# display(dunn.round(4))
display(dunn < 0.05)
Kruskal Wallis & Dunn's Test Results for Absolute Polarity vs. Business Category Kruskal-Wallis Statistic: 170.14 P-Value: 6.80e-35
| Arts & Entertainment | Beauty & Spas | Event Planning & Services | Hotels & Travel | Restaurants | Shopping | |
|---|---|---|---|---|---|---|
| Arts & Entertainment | False | True | True | False | True | False |
| Beauty & Spas | True | False | False | True | False | True |
| Event Planning & Services | True | False | False | True | False | True |
| Hotels & Travel | False | True | True | False | True | False |
| Restaurants | True | False | False | True | False | True |
| Shopping | False | True | True | False | True | False |
combined_filtered_exploded["stars_review"] = combined_filtered_exploded[
"stars_review"
].astype(int)
combined_filtered_exploded["stars_business_round"] = np.floor(
combined_filtered_exploded["stars_business"]
).astype(int)
sorted_categories_business = (
combined_filtered_exploded.groupby("categories")["stars_business_round"]
.value_counts(normalize=True)
.to_frame()
.reset_index(level="stars_business_round")[
combined_filtered_exploded.groupby("categories")[
"stars_business_round"
]
.value_counts(normalize=True)
.to_frame()
.reset_index(level="stars_business_round")["stars_business_round"]
== 5
]
.sort_values(by='proportion',ascending=False)
.index.tolist()
)
sorted_categories_review = (
combined_filtered_exploded.groupby("categories")["stars_review"]
.value_counts(normalize=True)
.to_frame()
.reset_index(level="stars_review")[
combined_filtered_exploded.groupby("categories")[
"stars_review"
]
.value_counts(normalize=True)
.to_frame()
.reset_index(level="stars_review")["stars_review"]
== 5
]
.sort_values(by='proportion',ascending=False)
.index.tolist()
)
combined_filtered_exploded["stars_review"] = pd.Categorical(
combined_filtered_exploded["stars_review"]
)
combined_filtered_exploded["stars_business_round"] = pd.Categorical(
combined_filtered_exploded["stars_business_round"]
)
# Create a figure and subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Stacked Bar Chart of Business Categories with Business Star Ratings
combined_filtered_exploded["categories"] = pd.Categorical(
combined_filtered_exploded["categories"],
categories=sorted_categories_business,
ordered=True,
)
sns.histplot(
data=combined_filtered_exploded,
y="categories",
hue="stars_business_round",
multiple="fill",
stat="proportion",
palette=sns.color_palette("hls", 5),
hue_order=combined_filtered_exploded[
"stars_business_round"
].cat.categories[::-1],
discrete=True,
shrink=0.8,
ax=axes[0],
)
axes[0].set_title("Proportion of Business Star Ratings by Business Category")
axes[0].set_ylabel("Business Category")
axes[0].set_xlabel("Proportion")
sns.move_legend(
axes[0],
"upper center",
bbox_to_anchor=(0.5, -0.15),
ncol=5,
title="Star Rating",
reverse=True,
)
# Stacked Bar Chart of Business Categories with Review Star Ratings
combined_filtered_exploded["categories"] = pd.Categorical(
combined_filtered_exploded["categories"],
sorted_categories_review,
ordered=True,
)
sns.histplot(
data=combined_filtered_exploded,
y="categories",
hue="stars_review",
multiple="fill",
stat="proportion",
palette=sns.color_palette("hls", 5),
hue_order=combined_filtered_exploded["stars_review"].cat.categories[::-1],
discrete=True,
shrink=0.8,
ax=axes[1],
)
axes[1].set_title("Proportion of Review Star Ratings by Business Category")
axes[1].set_ylabel("Business Category")
axes[1].set_xlabel("Proportion")
sns.move_legend(
axes[1],
"upper center",
bbox_to_anchor=(0.5, -0.15),
ncol=5,
title="Star Rating",
reverse=True,
)
# Adjust layout and show the combined plot
plt.tight_layout()
plt.show()
Remember, i had to round down the star ratings for this, also added 0.5
combined_filtered_exploded["categories"] = pd.Categorical(
combined_filtered_exploded["categories"],
category_order,
ordered=True,
)
# Chi-square
contingency_table = pd.crosstab(
combined_filtered_exploded["stars_business_round"],
combined_filtered_exploded["categories"],
)
contingency_table += 0.5
# Overall Chi-square
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print("Chi-Square Tests for Business Star Ratings by Category")
print("Chi-Square Statistic:", chi2.round(2))
print("P-Value:", p)
# Perform Chi-Square test for each pair of categories
categories = contingency_table.columns
p_values = []
comparisons = []
for cat1, cat2 in combinations(categories, 2):
sub_table = contingency_table[[cat1, cat2]]
chi2, p, _, _ = stats.chi2_contingency(sub_table)
p_values.append(p)
comparisons.append(f"{cat1} vs {cat2}")
# Adjust p-values using Bonferroni correction
_, p_adjusted, _, _ = smm.multipletests(p_values, method="bonferroni")
# Create a DataFrame for the results
post_hoc_results = pd.DataFrame(
{
"Comparison": comparisons,
"Adjusted P-Value": p_adjusted,
}
)
# Add significance column
post_hoc_results["Significant"] = post_hoc_results["Adjusted P-Value"] < 0.05
post_hoc_results = post_hoc_results.set_index("Comparison")
post_hoc_results.index.name = None
display(post_hoc_results)
Chi-Square Tests for Business Star Ratings by Category Chi-Square Statistic: 5772.58 P-Value: 0.0
| Adjusted P-Value | Significant | |
|---|---|---|
| Restaurants vs Event Planning & Services | 0.000000e+00 | True |
| Restaurants vs Shopping | 0.000000e+00 | True |
| Restaurants vs Beauty & Spas | 0.000000e+00 | True |
| Restaurants vs Arts & Entertainment | 4.753070e-59 | True |
| Restaurants vs Hotels & Travel | 0.000000e+00 | True |
| Event Planning & Services vs Shopping | 1.839501e-93 | True |
| Event Planning & Services vs Beauty & Spas | 6.533687e-41 | True |
| Event Planning & Services vs Arts & Entertainment | 1.876549e-87 | True |
| Event Planning & Services vs Hotels & Travel | 8.423155e-54 | True |
| Shopping vs Beauty & Spas | 1.556177e-28 | True |
| Shopping vs Arts & Entertainment | 4.248370e-57 | True |
| Shopping vs Hotels & Travel | 2.592184e-31 | True |
| Beauty & Spas vs Arts & Entertainment | 3.539321e-76 | True |
| Beauty & Spas vs Hotels & Travel | 1.637500e-24 | True |
| Arts & Entertainment vs Hotels & Travel | 5.797439e-99 | True |
# Chi-square
contingency_table = pd.crosstab(
combined_filtered_exploded["stars_review"],
combined_filtered_exploded["categories"],
)
contingency_table += 0.5
# Overall Chi-square
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print("Chi-Square Tests for Review Star Ratings by Category")
print("Chi-Square Statistic:", chi2.round(2))
print("P-Value:", p)
# Perform Chi-Square test for each pair of categories
categories = contingency_table.columns
p_values = []
comparisons = []
for cat1, cat2 in combinations(categories, 2):
sub_table = contingency_table[[cat1, cat2]]
chi2, p, _, _ = stats.chi2_contingency(sub_table)
p_values.append(p)
comparisons.append(f"{cat1} vs {cat2}")
# Adjust p-values using Bonferroni correction
_, p_adjusted, _, _ = smm.multipletests(p_values, method="bonferroni")
# Create a DataFrame for the results
post_hoc_results = pd.DataFrame(
{
"Comparison": comparisons,
"Adjusted P-Value": p_adjusted,
}
)
# Add significance column
post_hoc_results["Significant"] = post_hoc_results["Adjusted P-Value"] < 0.05
post_hoc_results = post_hoc_results.set_index("Comparison")
post_hoc_results.index.name = None
display(post_hoc_results)
Chi-Square Tests for Review Star Ratings by Category Chi-Square Statistic: 1322.49 P-Value: 4.5204657190659196e-268
| Adjusted P-Value | Significant | |
|---|---|---|
| Restaurants vs Event Planning & Services | 1.026970e-38 | True |
| Restaurants vs Shopping | 1.665348e-88 | True |
| Restaurants vs Beauty & Spas | 1.160864e-140 | True |
| Restaurants vs Arts & Entertainment | 8.086487e-01 | False |
| Restaurants vs Hotels & Travel | 3.864245e-45 | True |
| Event Planning & Services vs Shopping | 1.246295e-23 | True |
| Event Planning & Services vs Beauty & Spas | 1.014245e-40 | True |
| Event Planning & Services vs Arts & Entertainment | 2.167515e-16 | True |
| Event Planning & Services vs Hotels & Travel | 1.708936e-16 | True |
| Shopping vs Beauty & Spas | 8.806005e-23 | True |
| Shopping vs Arts & Entertainment | 3.255638e-32 | True |
| Shopping vs Hotels & Travel | 4.090113e-02 | True |
| Beauty & Spas vs Arts & Entertainment | 1.097079e-77 | True |
| Beauty & Spas vs Hotels & Travel | 5.170178e-12 | True |
| Arts & Entertainment vs Hotels & Travel | 4.089392e-29 | True |
combined["date"] = pd.to_datetime(combined["date"], errors="coerce")
combined_filtered_exploded["date"] = pd.to_datetime(
combined_filtered_exploded["date"], errors="coerce"
)
# combined_shorter = combined_filtered_exploded[
# (combined_filtered_exploded["date"] > (pd.to_datetime('2020-01-01 00:00:00') - pd.DateOffset(years=2))) &
# (combined_filtered_exploded["date"] < pd.to_datetime('2020-01-01 00:00:00'))
# ]
# combined_shorter = combined_filtered_exploded[
# (
# combined_filtered_exploded["date"]
# > (pd.to_datetime("2020-01-01 00:00:00") - pd.DateOffset(years=5))
# )
# & (
# combined_filtered_exploded["date"]
# < pd.to_datetime("2020-01-01 00:00:00")
# )
# ]
combined_filtered_exploded["stars_review"] = pd.to_numeric(
combined_filtered_exploded["stars_review"]
)
fig = px.scatter(
combined_filtered_exploded,
x="date",
y="stars_review",
trendline="lowess",
title="Review Star Ratings Over Time",
color="categories",
category_orders={"categories": category_order},
)
# Loop through fig.data and adjust the legend visibility
for trace in fig.data:
if trace.mode == "markers": # Scatter points
trace.showlegend = False # Do not show in the legend
elif trace.mode == "lines": # Trendline (LOWESS)
trace.showlegend = True # Show the trendline in the legend
fig.data = fig.data[1::2]
fig.update_layout(
xaxis_title="Date",
yaxis_title="Star Ratings",
font_family="Times New Roman",
font_color="black",
legend_title="Category",
showlegend=True,
legend=dict(
x=1.35, # Horizontal position of the legend
y=0.5, # Vertical position of the legend
xanchor="right", # Anchor the legend horizontally to the right
yanchor="middle", # Anchor the legend vertically to the center
),
)
fig.show(renderer="notebook")
fig1 = px.scatter(
combined_filtered_exploded,
x="date",
y="Polarity",
trendline="lowess",
title="Review Polarity Over Time",
color="categories",
category_orders={"categories": category_order},
)
# Loop through fig.data and adjust the legend visibility
for trace in fig1.data:
if trace.mode == "markers": # Scatter points
trace.showlegend = False # Do not show in the legend
elif trace.mode == "lines": # Trendline (LOWESS)
trace.showlegend = True # Show the trendline in the legend
fig1.data = fig1.data[1::2]
fig1.update_layout(
xaxis_title="Date",
yaxis_title="Polarity",
font_family="Times New Roman",
font_color="black",
legend_title="Category",
showlegend=True,
legend=dict(
x=1.35, # Horizontal position of the legend
y=0, # Vertical position of the legend
xanchor="right", # Anchor the legend horizontally to the right
yanchor="bottom", # Anchor the legend vertically to the center
),
)
fig2 = px.scatter(
combined_filtered_exploded,
x="date",
y="Subjectivity",
trendline="lowess",
title="Review Subjectivity Over Time",
color="categories",
category_orders={"categories": category_order},
)
fig2.data = fig2.data[1::2]
fig2.update_layout(
xaxis_title="Date",
yaxis_title="Subjectivity",
font_family="Times New Roman",
font_color="black",
legend_title="Category",
)
fig3 = px.scatter(
combined_filtered_exploded,
x="date",
y="Abs_polarity",
trendline="lowess",
title="Review Absolute Polarity Over Time",
color="categories",
category_orders={"categories": category_order},
)
fig3.data = fig3.data[1::2]
fig3.update_layout(
xaxis_title="Date",
yaxis_title="Absolute Polarity",
font_family="Times New Roman",
font_color="black",
)
# Create subplots with two columns
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=(
"Review Polarity Over Time",
"Review Subjectivity Over Time",
"Review Absolute Polarity Over Time",
), # Titles for each subplot
)
# Add the traces from fig1 to the first subplot (column 1)
for trace in fig1.data:
fig.add_trace(trace, row=1, col=1)
# Add the traces from fig2 to the second subplot (column 2)
for trace in fig2.data:
fig.add_trace(trace, row=1, col=2)
for trace in fig3.data:
fig.add_trace(trace, row=2, col=1)
fig.show(renderer="notebook")